tl;dr

A method has been developed to calculate a single index for a build and platform, based upon the Raptor test results for various run_suite and corresponding metrics of the build. This method handles potentially missing run_suite measurements.

The method involves the usage of a reference:

  1. Selecting a set of one or more reference builds. These could be based upon a reference date, or upon a build that corresponds to a specific release or other build characteristic.
  2. Geomeaning the loadtime and fcp of each sample (total of 25) for each run_suite.
  3. Select a set of reference pages for scoring. Criteria for selection can include most frequently used (i.e., greatest coverage in time), or greatest variation in the metrics (i.e., greatest coverage in metric space).
  4. Pool the samples for each run_suite for all reference builds.
  5. Calculate the sample empirical cumulative distribution function (eCDF) of each run_suite.

Once the CDFs have been calculated, scores can be calculated for every build:

  1. For each run_suite build sample, calculate its \(p_{run\_suite}\) from the relevant eCDF.
  2. If necessary, impute \(p_{run\_suite}\) for missing run_suite.
  3. Calculate the mean (\(p\)) and standard deviation(\(\sigma_{p}\)) of \(p_{run\_suite}\) from the 25 samples.
  4. Calculate the final score by taking the mean across \(p\) .

The final score is a value from 0 to 1, with lower being better (e.g., decreased loadtime and/or fcp)

The following plot illustrates the result for a geomean loadtime and fcp across builds since February 1st for the Windows 10-64 platform.

Introduction

The following analysis focuses upon Raptor page load metrics from warm runs, on the Windows platform, for mozilla-central builds.

Raptor

A brief description of Raptor page load testing follows:

  • Every push to the repo (e.g. mozilla-central, mozilla-inbound, try) fires off the creation of a build.
  • Every new build fires off a set of run_suite page load tests for each platform (e.g., Windows 10-64).
    • A run_suite is a serialized snapshot of a web page (e.g., Amazon, Facebook) that is played back via Mitmproxy.
    • After loading the page once after startup, the page is loaded 25 times for Desktop warm tests.
  • Each run_suite test measures four metrics for each of the 25 samples: dcf, fcp, fnbpaint, loadtime

Therefore, the results for a build and platform is composed of n run_suite, 4 metrics, and 25 samples (\(\underset{n\times 4 \times 25}{\mathrm{X}})\).

Build Score

A single score for a Firefox build should have two traits: (i) actionability, and (ii) interpretability. To produce a single score from Raptor page load tests requires multiple levels of aggregation:

  1. 4 metrics (dcf, fcp, fnbpaint, loadtime) into a single value
  2. 25 samples into a single value
  3. All run_suite into a single value

Run Suite Incompleteness

One characteristic of Raptor testing is that there are many different cases of run_suite incompleteness, where one or more run_suite are not tested for a specific build.

  • Errors in a specific page causing the tests to fail.
    • This can span across weeks of builds.
  • Specific pages not being tested.

These complicate aggregating the individual run_suite into a single value in a consistent manner. For this analysis, 13 of the most common run_suite were chosen to minimize these issues. Builds that were missing one or more run_suite were dropped from the analysis. 474 of builds had the complete set of run_suite out of 569 (83%).

The dropped and incomplete run_suite are apparent in following figure. NOTE: The scales for each facet are independent to illustrate the changes of these timings across build/time.

Imputation

One method that can help resolve the dropped run_suite issue, is imputating the missing values. This can be performed across a large range of historical data, and can take into account the entire joint run_suite distribution. Such methods include matrix completion.

As seen in the figure above, there are several popular pages, which loose data on May 1st. The combined increased uncertainty of imputating multiple run_suite yields a potentially unreliable score for these cases. This remains a weakness with this method that needs to be addressed.

Clustering

Another method that can help both issues is mapping the run_suite into a set of categories, each similar in its historical page load behavior. The aggregation across run_suite is performed across categories, where run_suite within the same categories are averaged. Therefore, each category has equal weight. Missing run_suite in a given category will have their contribution coming from available run_suite in the same category.

This method still requires a minimum set of run_suite to be available, namely at least one of each category.

Test Aggregation

Each run_suite currently contains four metrics: dcf, fcp, fnbpaint, loadtime. Currently, the mean of the 25 samples for each metric is calculated, and the results are geomeaned together. The following analysis compares these metrics across samples.

Correlation

The metrics are expected to be correlated, as they are measuring different aspects of the same process. Given the degree of correlation, some measures may be overly redundant and can be excluded from the score.

Immediately evident is that fcp and fnbpaint are almost perfectly correlated. Therefore, only one should be added to aggregation. In addition, fcp/fnbpaint have a high degree of correlation with dcf. The lowest levels of correlation are with fcp and loadtime.

PCA

A common method for dimensionality reduction utilized in generating composite indices is PCA. One strength of PCA is that it is “data-driven”. However, intepretability of the final index can be lacking. The following performs PCA on the metrics loadtime, fcp and dcf:

Almost 77% of the variance observed can be explained by a single dimension. Utilizing the first dimension as the aggregated metric score is possible. However, it isn’t easily intepretable, due to the differing contributions (weight) of each metric to this dimension.

The factor loadings for dcf and fcp are very similar. Therefore, we select loadtime and fcp as the two metrics to utilize in the metric aggregation step.

Geomean

The geomean of loadtime and fcp is used to reduce these metrics to a single value. The geomean has the nice properties of being easily intepretable, in addition to equally weighting each of the metrics. This yields a single value for each of the 25 samples.

eCDF

To aggregate across run_suite metrics requires mapping to a common scale. Page loading timings can vary widely across different pages, leading to different scales as seen in the plot above. The method following achieves a mapping by referencing against a set of builds. These reference builds can be determined in different ways, including

  • A set around a given date.
  • A set around a given release build.

For the following analysis, we choose a set of five builds from the end of the observation window (end of May). As noted above, a set of the most common 13 run_suite for the builds throughout the observation window were used.

For each run_suite, all of the reference build samples were combined into a single vector of 125 samples. Next, the eCDF was fit from these samples. This yielded an eCDF for each run_suite.

Finally, each geomean sample is mapped to \(p_{run\_suite}\) using the relevant eCDF. The following is an illustrative plot of the technique, where the blue points were used to fit the eCDF, and the orange point represent newly mapped samples.

Sample Aggregation

After mapping the 25 samples to \(p_{run\_suite}\), the mean (\(p\)) and standard deviation (\(\sigma_{p}\)) are calculated. This yields two values for each run_suite for a given build and platform.

The following is a plot of the result for the end of May reference builds and geomean loadtime and fcp. The points represent builds and shaded region represent a standard deviation

Run Suite Aggregation

The final step is aggregregating the \(p\) for each run_suite into a single value for the build and platform. This is achieved taking the mean of \(p\) and \(\sigma_{p}\).

Characteristics to note:

  • Large jump in the middle of February: this is observed across several run_suite
    • Observed across several run_suite in the timings
    • Due to Mitmproxy snapshots being updated on this date.
  • Small drop at the beginning of May.
    • Multiple run_suite have varying levels of decrease, whereas raptor-tp6-bing-firefox increases. The net effect is a decrease.
  • Increase in \(\sigma_{p}\) at beginning of May.

Geomean versus Load Time

The analysis above aggregates the metrics by utilizing a geomean. In practice, loadtime has been used for decision making purposes. In this case, the geomean can be replace with loadtime in the calculation of the build score. The following plot shows the differences between the geomean and loadtime scores:

Observations of the two scoring methods:

  • The large jump in the middle of Feburary is more extreme for Load Time * FCP geomean.
  • Smaller \(\sigma_{p}\) for Load Time * FCP
  • Similar trend through the date of build.

Build Score Variation

There is still quite a bit of observed variation of this score across builds. This could be characterized and included into the build score by how the \(p\) and \(\sigma_{p}\) are calculated and aggregated, to produce a more stable measure.

Additional Platforms

Next Steps

  • Apply scoring method to other build platforms.
  • Imputation of missing build suites.
    • Determination of their effectiveness, by artifically dropping run_suite across ranges of time observed, and recalculating build score.
  • Investigation of aggregation methods of \(p_{run\_suite}\) to address the high levels of variation in the score.
  • Investigate statistical methods for weighting the pages.
    • e.g.: Clustering across metrics.
  • Aggregating across platforms to give a single score for an OS.
  • This requires experts to weight the relative importance of each build.

Reference

The code that exported the data from ActiveData and prepared the dataset for this analysis is available here and here.